# Cross-modal retrieval
So400m Long
Apache-2.0
A vision-language model fine-tuned based on SigLIP 2, with maximum text length increased from 64 to 256 tokens
Text-to-Image
Transformers English

S
fancyfeast
27
3
LLM2CLIP Openai L 14 224
Apache-2.0
LLM2CLIP is an innovative approach that leverages large language models (LLMs) to unlock the potential of CLIP. It enhances text discriminability through a contrastive learning framework, breaking the limitations of the original CLIP text encoder.
Text-to-Image
L
microsoft
108
5
LLM2CLIP Openai B 16
Apache-2.0
LLM2CLIP is an innovative method that leverages large language models (LLMs) to extend CLIP's capabilities, enhancing text discriminability through a contrastive learning framework and significantly improving cross-modal task performance.
Text-to-Image
Safetensors
L
microsoft
1,154
18
LLM2CLIP EVA02 L 14 336
Apache-2.0
LLM2CLIP is an innovative approach that enhances CLIP's visual representation capabilities through large language models (LLMs), significantly improving cross-modal task performance
Text-to-Image
PyTorch
L
microsoft
75
60
Safeclip Vit L 14
Safe-CLIP is an enhanced vision-language model based on CLIP, designed to mitigate risks associated with NSFW (Not Safe For Work) content in AI applications.
Text-to-Image
Transformers

S
aimagelab
931
3
Nllb Siglip Mrl Large
NLLB-SigLIP-MRL is a multilingual vision-language model that combines the text encoder from NLLB and the image encoder from SigLIP, supporting 201 languages from Flores-200.
Image-to-Text
N
visheratin
297
14
Nllb Siglip Mrl Base
A multilingual vision-language model combining NLLB text encoder and SigLIP image encoder, supporting 201 languages and multiple embedding dimensions
Image-to-Text
N
visheratin
352
9
Owlvit Tiny Non Contiguous Weight
MIT
OWL-ViT is a vision Transformer-based open-vocabulary object detection model capable of detecting categories not present in the training dataset.
Text-to-Image
Transformers

O
fxmarty
337
0
Nllb Clip Base Siglip
NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder from NLLB and the image encoder from SigLIP, supporting 201 languages.
Text-to-Image
N
visheratin
478
1
Nllb Clip Large Siglip
NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder of the NLLB model and the image encoder of the SigLIP model, supporting 201 languages.
Text-to-Image
N
visheratin
384
5
Metaclip L14 400m
MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.
Text-to-Image
Transformers

M
facebook
325
3
Metaclip L14 Fullcc2.5b
MetaCLIP is a large-scale vision-language model trained on 2.5 billion data points from CommonCrawl (CC), revealing CLIP's data filtering methodology
Text-to-Image
Transformers

M
facebook
172
3
Metaclip B16 400m
MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces
Text-to-Image
Transformers

M
facebook
51
1
Metaclip B32 Fullcc2.5b
MetaCLIP is a vision-language model trained on 2.5 billion data points from CommonCrawl (CC) to construct a shared image-text embedding space.
Text-to-Image
Transformers

M
facebook
413
7
Nllb Clip Base Oc
NLLB-CLIP is a multilingual vision-language model combining the NLLB text encoder with the CLIP image encoder, supporting 201 languages
Text-to-Image
N
visheratin
371
1
Languagebind Audio
MIT
LanguageBind is a language-centric multimodal pre-training method that extends video-language pre-training to N modalities through language semantic alignment, achieving high-performance multimodal understanding and alignment.
Multimodal Alignment
Transformers

L
LanguageBind
271
3
CLIP ViT L 14 CommonPool.XL.clip S13b B90k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
laion
534
1
Altclip M18
AltCLIP-m18 is a CLIP model supporting 18 languages for image-text matching tasks.
Text-to-Image
Transformers

A
BAAI
58
5
Clip Fa Vision
CLIPfa is the Persian version of OpenAI's CLIP model, connecting Persian text and image representations through contrastive learning
Text-to-Image
Transformers

C
SajjadAyoubi
43
5
Featured Recommended AI Models